0.1 Setting up the environment
library(readr)
library(tidyverse)
library(h2o)
library(car)
library(highcharter)
library(boot)
source("/Users/deyapple/Documents/Courses/Term04/DA/Homework/hw_07/images/unbalanced_functions.R")
mdata <- read_csv("/Users/deyapple/Documents/Courses/Term04/DA/Homework/hw_07/images/murder_suicide.csv")1 Q1
1.1 Chossing A Non-redundant Subset
murder <- mdata %>%
select(ResidentStatus,
Education = Education2003Revision,
Sex,
Age = AgeRecode52,
PlaceOfDeathAndDecedentsStatus,
MaritalStatus,
MethodOfDisposition,
MannerOfDeath,
Autopsy,
ActivityCode,
PlaceOfInjury,
Race = RaceRecode5)After selecting a subset of variables, I have looked at all the variables containing unknown values. Based on the number of unknown observations, I have either deleted those observations or replaced them with the mean.
1.2 Unknown Values
1.2.1 Unknown Education
The number of observations is high, therefore I’ve replaced them with the mean of the others.
nrow(murder %>%
filter(Education == 9))[1] 1162
meanedu <- mean(murder %>%
filter(Education != 9) %>%
select(Education) %>%
unlist() %>%
unname())
murder <- murder %>%
mutate(Education = ifelse(Education == 9, round(meanedu), Education))
murder$Education = as.integer(murder$Education)1.2.2 Unknown Age
nrow(murder %>%
filter(Age == 52))[1] 16
murder <- murder %>%
filter(Age != 52)1.2.3 Unknown PlaceOfDeathAndDecedentsStatus
nrow(murder %>%
filter(PlaceOfDeathAndDecedentsStatus == 9))[1] 162
murder <- murder %>%
filter(PlaceOfDeathAndDecedentsStatus != 9)1.2.4 Unknown MaritalStatus
nrow(murder %>%
filter(MaritalStatus == "U"))[1] 630
murder$MaritalStatusFactor = as.integer(as.factor(murder$MaritalStatus))
meanm <- mean(murder %>%
filter(MaritalStatusFactor != 4) %>%
select(MaritalStatusFactor) %>%
unlist() %>%
unname)
murder <- murder %>%
mutate(MaritalStatus = ifelse(MaritalStatus == "U", "M", MaritalStatus),
MaritalStatusFactor = ifelse(MaritalStatusFactor == 4, round(meanm), MaritalStatusFactor))1.2.5 Unknown MethodOfDisposition
nrow(murder %>%
filter(MethodOfDisposition == "U"))[1] 5627
murder$MethodOfDispositionFactor = as.integer(as.factor(murder$MethodOfDisposition))
meand <- mean(murder %>%
filter(MethodOfDispositionFactor != 7) %>%
select(MethodOfDispositionFactor) %>%
unlist() %>%
unname())
murder <- murder %>%
mutate(MethodOfDisposition = ifelse(MethodOfDisposition == "U", "C", MethodOfDisposition),
MethodOfDispositionFactor = ifelse(MethodOfDispositionFactor == 7, round(meand), MethodOfDispositionFactor))1.2.6 Not Applicable Activity Code
nrow(murder %>%
filter(ActivityCode == 99))[1] 1143
murder <- murder %>%
mutate(ActivityCode = ifelse(ActivityCode == 99, 9, ActivityCode))1.3 Correlation Matrix
Filter(is.numeric, murder) -> murder.num
hchart(cor(murder.num))murder.num.cor <- cor(murder.num, method = "spearman")
murder.cor.melt = melt(murder.num.cor)
ggplot(murder.cor.melt, aes(Var1, Var2)) +
geom_tile(aes(fill = value)) +
geom_text(aes(fill = value, label = round(value, 2)), size = 5, color = "black") +
scale_fill_gradient2(low = muted("gold"),
mid = "mistyrose",
high = muted("mediumvioletred"),
midpoint = 0) +
theme(panel.grid.major.x=element_blank(),
panel.grid.minor.x=element_blank(),
panel.grid.major.y=element_blank(),
panel.grid.minor.y=element_blank(),
panel.background=element_rect(fill="white"),
axis.text.x = element_text(angle = 90, hjust = 1,vjust=1,size = 15,face = "bold"),
plot.title = element_text(size=20,face="bold"),
axis.text.y = element_text(size = 15,face = "bold")) +
ggtitle("Correlation Plot") +
theme(legend.title=element_text(face="bold", size=14)) +
scale_x_discrete(name="") +
scale_y_discrete(name="") +
labs(fill="Corr. Coef.")1.4 Scatter Plot Matrix
scatterplotMatrix(murder.num.cor)2 Q2
2.1 Studying The Effect of A Few Variables
murder <- murder %>%
mutate(Suicide = ifelse(MannerOfDeath == 2, 1, 0))In each of the subsections below, whenever the \(p-value\) generated from the test is significant(meaning less than 0.05), it means that we can reject the null hypothesis that Suicide is independent from the variable being tested(meaning different values of the variable do not affect Suicide), otherwise we fail to reject the null hypothesis.
2.1.1 Gender
Significant \(p-value\)
chisq.test(murder$Sex, murder$Suicide)
Pearson's Chi-squared test with Yates' continuity correction
data: murder$Sex and murder$Suicide
X-squared = 23.23, df = 1, p-value = 1.437e-06
2.1.2 Race
Significant \(p-value\).
chisq.test(murder$Race, murder$Suicide)
Pearson's Chi-squared test
data: murder$Race and murder$Suicide
X-squared = 15250, df = 3, p-value < 2.2e-16
2.1.3 Education
Significant \(p-value\).
chisq.test(murder$Education, murder$Suicide)
Pearson's Chi-squared test
data: murder$Education and murder$Suicide
X-squared = 3558.3, df = 8, p-value < 2.2e-16
2.1.4 Age
Significant \(p-value\).
chisq.test(murder$Age, murder$Suicide)
Pearson's Chi-squared test
data: murder$Age and murder$Suicide
X-squared = 6724.3, df = 42, p-value < 2.2e-16
2.1.5 Method Of Disposition
Significant \(p-value\).
chisq.test(murder$MethodOfDisposition, murder$Suicide)
Pearson's Chi-squared test
data: murder$MethodOfDisposition and murder$Suicide
X-squared = 4252.3, df = 5, p-value < 2.2e-16
3 Q3
3.1 Fitting A Model Using glm()
Some of the predictors can’t be viewed as numerical data such as PlaceOfDeathAndDecedentsStatus and Race. I have used as.factor() to account for this.
murder$PlaceOfDeathAndDecedentsStatus = as.factor(murder$PlaceOfDeathAndDecedentsStatus)
murder$Race = as.factor(murder$Race)
fit1 <- glm(data = murder,
Suicide ~ Education + Sex + Age +
PlaceOfDeathAndDecedentsStatus + MaritalStatus +
MethodOfDisposition + Race,
family = "binomial")
summary(fit1)
Call:
glm(formula = Suicide ~ Education + Sex + Age + PlaceOfDeathAndDecedentsStatus +
MaritalStatus + MethodOfDisposition + Race, family = "binomial",
data = murder)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9245 -0.4615 0.3807 0.6142 2.5920
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.943902 0.142377 -20.677 < 2e-16 ***
Education 0.184284 0.007228 25.495 < 2e-16 ***
SexM 0.314484 0.027638 11.379 < 2e-16 ***
Age 0.076176 0.003903 19.517 < 2e-16 ***
PlaceOfDeathAndDecedentsStatus2 -0.411428 0.043396 -9.481 < 2e-16 ***
PlaceOfDeathAndDecedentsStatus3 0.161772 0.085141 1.900 0.057427 .
PlaceOfDeathAndDecedentsStatus4 1.357213 0.038659 35.107 < 2e-16 ***
PlaceOfDeathAndDecedentsStatus5 -0.725411 0.175044 -4.144 3.41e-05 ***
PlaceOfDeathAndDecedentsStatus6 -0.688298 0.166279 -4.139 3.48e-05 ***
PlaceOfDeathAndDecedentsStatus7 0.286811 0.037960 7.556 4.17e-14 ***
MaritalStatusM 0.220729 0.035756 6.173 6.69e-10 ***
MaritalStatusS -0.143932 0.036190 -3.977 6.98e-05 ***
MaritalStatusW -0.236247 0.061314 -3.853 0.000117 ***
MethodOfDispositionC 0.792471 0.024501 32.345 < 2e-16 ***
MethodOfDispositionD 1.211853 0.285413 4.246 2.18e-05 ***
MethodOfDispositionE 0.078582 0.162566 0.483 0.628822
MethodOfDispositionO -0.281884 0.161314 -1.747 0.080563 .
MethodOfDispositionR 0.039925 0.062053 0.643 0.519961
Race2 -2.141663 0.029028 -73.779 < 2e-16 ***
Race3 -0.398895 0.083023 -4.805 1.55e-06 ***
Race4 -0.152617 0.068629 -2.224 0.026162 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 70984 on 59800 degrees of freedom
Residual deviance: 48884 on 59780 degrees of freedom
AIC: 48926
Number of Fisher Scoring iterations: 5
For each each data point the associated deviance is calculated. This results in a set of residuals. This first part of our model(Deviance Residuals’ Min, 1Q, etc.) is simply a non-parametric description of the residuals’ distribution.
Taking our model as a whole, the Residual deviance measures the lack of fit, whereas the Null deviance is a measure for a reduced model only consisting of the intercept.
Overall, the \(p-values\) seem to be significant, allthough there are a few levels of some of the categorical variables which have a high \(p-value\)(such as MethodOfDisposition). However, it is best not to modify our predictors by deleting the levels with high \(p-values\) and keeping the ones with low\(p-value\)s.
Let’s consider deleting MethodOfDisposition from our model(since 3 out of 5 levels have insignificant \(p-values\)):
fit2 <- glm(data = murder,
Suicide ~ Education + Sex + Age +
PlaceOfDeathAndDecedentsStatus + MaritalStatus + Race,
family = "binomial")
summary(fit2)
Call:
glm(formula = Suicide ~ Education + Sex + Age + PlaceOfDeathAndDecedentsStatus +
MaritalStatus + Race, family = "binomial", data = murder)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8105 -0.4762 0.3970 0.6308 2.5323
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.831326 0.141991 -19.940 < 2e-16 ***
Education 0.150195 0.006972 21.544 < 2e-16 ***
SexM 0.282516 0.027272 10.359 < 2e-16 ***
Age 0.092454 0.003883 23.808 < 2e-16 ***
PlaceOfDeathAndDecedentsStatus2 -0.429922 0.042848 -10.034 < 2e-16 ***
PlaceOfDeathAndDecedentsStatus3 0.133277 0.084283 1.581 0.114
PlaceOfDeathAndDecedentsStatus4 1.371779 0.038217 35.894 < 2e-16 ***
PlaceOfDeathAndDecedentsStatus5 -0.721002 0.173872 -4.147 3.37e-05 ***
PlaceOfDeathAndDecedentsStatus6 -0.699480 0.165010 -4.239 2.24e-05 ***
PlaceOfDeathAndDecedentsStatus7 0.315316 0.037470 8.415 < 2e-16 ***
MaritalStatusM 0.137952 0.035301 3.908 9.31e-05 ***
MaritalStatusS -0.193289 0.035748 -5.407 6.41e-08 ***
MaritalStatusW -0.361612 0.060610 -5.966 2.43e-09 ***
Race2 -2.299078 0.028545 -80.542 < 2e-16 ***
Race3 -0.641707 0.081656 -7.859 3.88e-15 ***
Race4 -0.177164 0.067829 -2.612 0.009 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 70984 on 59800 degrees of freedom
Residual deviance: 49997 on 59785 degrees of freedom
AIC: 50029
Number of Fisher Scoring iterations: 5
The AIC value turned out higher than our previous model. As well as a higher AIC value, the second model also has higher Residual Deviance, therefor fit seems to be more suitable than fit2. (Take note that AIC is uninformative when we only have one model. It can only be used to compare models.)
3.2 Testing How Well Our Model Fits
3.2.1 Hosmer-Lemeshow Goodness of Fit
The Hosmer-Lemeshow test is a goodness of fit test for logistic regression, only used for binary response variables (Suicide or not). The test’s output consists of a Hosmer-Lemeshow chi-squared value and a \(p-value\)(\(H_0\): the fitted model is correct). Small \(p-value\)s mean that the model is a poor fit.
library(ResourceSelection)hoslem.test(murder$Suicide, fitted(fit1))
Hosmer and Lemeshow goodness of fit (GOF) test
data: murder$Suicide, fitted(fit1)
X-squared = 64.351, df = 8, p-value = 6.484e-11
hoslem.test(murder$Suicide, fitted(fit2))
Hosmer and Lemeshow goodness of fit (GOF) test
data: murder$Suicide, fitted(fit2)
X-squared = 93.436, df = 8, p-value < 2.2e-16
3.2.2 Diagnostic Plots
The top right plot is a measure of normality. The dots are close to the dashed line, therefor we can conclude that our model justifies this assumption.
The top left plot should be close to the zero line and free of any pattern, which has clearly been violated in our model. This plot is a measure of linearity.
In the bottom right plot the dots all be in the 0.05 range, which are visibly lower, so our model has behaved well in this aspect. This plot is used to determine influential observations.
Overall, we can conclude that this model is not a good fit.
glm.diag.plots(fit1, glmdiag = glm.diag(fit1))4 Q4
murder <- murder %>%
mutate(pred = fitted(fit1, type = "response"))4.1 Density Plot
As you can see, the peeks of each distribution is far apart, therefor our model has behaved well in distinguishing most of the suicides from the murders. However, towards the right side of the plot, we can see that our model could not differentiate between the suicides and the murders.
ggplot(murder) +
geom_density(aes(x = pred, fill = as.factor(Suicide)), alpha = 0.5, size = 1) +
theme_minimal()4.2 Histogram
We can see that the difference in means of predictions for suicides and murders seem to be high.
ggplot(murder) +
geom_histogram(aes(x = Suicide, y = mean(pred), fill = as.factor(Suicide)), stat = "identity") +
facet_grid(~Suicide) +
theme_minimal()4.3 Scatter Plot + Line Graph
From the below graph, we can see that in contrast to what was shown before, the model did not perform well.
ggplot() +
geom_line(aes(x = fit1$linear.predictors, y = fit1$fitted.values)) +
geom_point(aes(x = fit1$linear.predictors, y = murder$Suicide)) +
theme_minimal()4.4 Scatter Plot
We can tell that our model performs well for high and low values of prediction, but not so much for the average values.
ggplot(murder) +
geom_point(aes(x = Age, y = pred, col = as.factor(Suicide)), size = 1, position = "jitter") +
theme_minimal()5 Q5
5.1 Dividing The Data into Train and Test
murder <- na.omit(murder)
index = sample(x = 1:nrow(murder),
size = 0.8 * nrow(murder),
replace = F)
train = murder[index,]
test = murder[-index,]
model_glm <- glm(data = train,
Suicide ~ Education + Sex + Age +
PlaceOfDeathAndDecedentsStatus + MaritalStatus +
MethodOfDisposition + Race,
family = "binomial")test$prediction = predict(model_glm, newdata = test, type = "response")
train$prediction = predict(model_glm, newdata = train, type = "response")5.2 Calculating P, N, TP, …
P <- test %>%
filter(Suicide == 1) %>%
nrow()
P[1] 8633
N <- test %>%
filter(Suicide == 0) %>%
nrow()
N[1] 3328
TP <- test %>%
filter(Suicide == 1,
prediction >= .5) %>%
nrow()
TP[1] 8110
TN <- test %>%
filter(Suicide == 0,
prediction < .5) %>%
nrow()
TN [1] 1772
FP <- test %>%
filter(Suicide == 0,
prediction >= .5) %>%
nrow()
FP[1] 1556
FN <- test %>%
filter(Suicide == 1,
prediction < .5) %>%
nrow()
FN[1] 523
ACC <- (TP + TN) / (P + N)
ACC[1] 0.8261851
FPR <- 1 - TN / N
FPR[1] 0.4675481
TPR <- TP / P
TPR[1] 0.9394185
5.3 Visualization of FP, FN, TP and TN
5.3.1 Confusion Matrix
cm_info = ConfusionMatrixInfo( data = test, predict = "prediction",
actual = "Suicide", cutoff = .5)
cm_info$plot5.3.2 Table
table(test$Suicide,ifelse(test$prediction>0.5,1,0)) %>% plot()6 Q6
accuracy_info = AccuracyCutoffInfo(train = train, test = test,
predict = "prediction", actual = "Suicide" )
accuracy_info$plotSince each time we generate the code, different subsets are chosen for test and train, I’ve written the code below the find out the cutoff related to maximum accuracy.
dacc <- accuracy_info$data
dacc.max <- dacc %>%
mutate(trainmax = (train == max(train)),
testmax = (test == max(test)))
test.max <- dacc.max %>%
filter(testmax)
train.max <- dacc.max %>%
filter(trainmax)new test cutoff:
test.cutoff <- test.max$cutoff
test.cutoff[1] 0.55
new train cutoff:
train.cutoff <- train.max$cutoff
train.cutoff[1] 0.5
To better visualize the change in accuracy, I have once again drawn a confusion matrix and a table but this time, with the (maybe) different cutoff.
cm_info = ConfusionMatrixInfo( data = test, predict = "prediction",
actual = "Suicide", cutoff = test.cutoff)
cm_info$plottable(test$Suicide,ifelse(test$prediction>test.cutoff, 1, 0)) %>% plot()7 Q7
7.1 ROC Curve
cost_fp = 100;cost_fn = 200
roc_info = ROCInfo(data = cm_info$data, predict = "predict",
actual = "actual", cost.fp = cost_fp, cost.fn = cost_fn )
grid.draw(roc_info$plot)The bottom green part of the curve that intersects with the vertical dashed line shows the cutoff with minimum cost. (maximum accuracy)
8 Using H2O
library(h2o)
h2o.init()
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/var/folders/mk/jrl557gd0qz741jsx86f3gjr0000gn/T//RtmpuAaQGK/h2o_deyapple_started_from_r.out
/var/folders/mk/jrl557gd0qz741jsx86f3gjr0000gn/T//RtmpuAaQGK/h2o_deyapple_started_from_r.err
Starting H2O JVM and connecting: ... Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 5 seconds 18 milliseconds
H2O cluster timezone: Asia/Tehran
H2O data parsing timezone: UTC
H2O cluster version: 3.19.0.4257
H2O cluster version age: 19 days
H2O cluster name: H2O_started_from_R_deyapple_ste729
H2O cluster total nodes: 1
H2O cluster total memory: 1.78 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.3 (2017-11-30)
hmurder <- as.h2o(murder)
|
| | 0%
|
|=================================================================| 100%
hglm = h2o.glm(y = "Suicide", x= c("Education","Sex","Age", "PlaceOfDeathAndDecedentsStatus", "MaritalStatus", "MethodOfDisposition",
"Race"),
training_frame = hmurder, family="binomial",nfolds = 5)
|
| | 0%
|
|====== | 9%
|
|======= | 10%
|
|=================================================================| 100%
The MSE/RMSE isn’t really high, therefor we can conclude that our model isn’t completely off.
Log Loss quantifies the accuracy of a classifier by penalising false classifications(source). It is used in many competition such as competitions on Kaggle. The goal is to minimize LogLoss. Since the LogLoss of our model relatively high, we can conclude that our model does not have a high accuracy.
AUC (Area Under Curve) is high with also questions the goodness of fit of our model.
The \(R^2\) Statistic is also low, which means only 35% of the variance in our response was explained by the model.
Let’s take a look at the part labeled Maximum Metrics. The fourth metric gives information related to the cutoff giving the maximum accuracy. The threshold column (0.530612) is the cutoff for which the maximum accuracy occurs. This maximum value is shown in the value column(0.820756).
h2o.performance(hglm)H2OBinomialMetrics: glm
** Reported on training data. **
MSE: 0.131186
RMSE: 0.3621961
LogLoss: 0.4203796
Mean Per-Class Error: 0.2910992
AUC: 0.8363014
Gini: 0.6726028
R^2: 0.3500981
Residual Deviance: 50278.24
AIC: 50302.24
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 7623 9156 0.545682 =9156/16779
1 1571 41451 0.036516 =1571/43022
Totals 9194 50607 0.179378 =10727/59801
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.395048 0.885431 272
2 max f2 0.209494 0.937819 328
3 max f0point5 0.721475 0.866406 160
4 max accuracy 0.530612 0.820756 231
5 max precision 0.981940 1.000000 0
6 max recall 0.055379 1.000000 388
7 max specificity 0.981940 1.000000 0
8 max absolute_mcc 0.554530 0.527019 223
9 max min_per_class_accuracy 0.750321 0.763935 150
10 max mean_per_class_accuracy 0.721475 0.766193 160
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
9 Q9
Overall, the model presented here, due to various reasons stated in the previous sections, is definitely not suitable for use in real life court.
The accuracy of our model (about 0.8) is not high enough to be reliable. There aren’t enough information (predictors) to decide from.
The True Positive Rate(TPR) is really high. Therefor our model behaves well in recognizing actual suicides, but behaves poorly in recognizing murders. However the number of True Negatives are extremely low compared to the total number of negatives. This is equivalent to setting a criminal free. The False Positive Rate is also high (0.4). This means more than one out of three times, a death that was actually a suicide will be declared a murder. This is equivalent to sending an innocent person to prison. The former error is much worse than the latter. In court it is much more preferred to set a criminal free than to send an innocent person to prison. However, since it is impossible to decrease both these errors to (even close to) zero, it is safer to not use our basic model in court where there is a matter of life and death :). Nevertheless, it can be used to help find a better model with less error